Surrogate Key
   HOME

TheInfoList



OR:

A surrogate key (or synthetic key, pseudokey, entity identifier, factless key, or technical key) in a
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
is a unique identifier for either an ''entity'' in the modeled world or an ''object'' in the database. The surrogate key is ''not'' derived from application data, unlike a ''natural'' (or ''business'') key.


Definition

There are at least two definitions of a surrogate: ; Surrogate (1) – Hall, Owlett and Todd (1976): A surrogate represents an ''entity'' in the outside world. The surrogate is internally generated by the system but is nevertheless visible to the user or application. ; Surrogate (2) – Wieringa and De Jonge (1991): A surrogate represents an ''object'' in the database itself. The surrogate is internally generated by the system and is invisible to the user or application. The ''Surrogate (1)'' definition relates to a
data model A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be co ...
rather than a storage model and is used throughout this article. See Date (1998). An important distinction between a surrogate and a primary key depends on whether the database is a current database or a
temporal database A temporal database stores data relating to time instances. It offers temporal data types and stores information relating to past, present and future time. Temporal databases could be uni-temporal, bi-temporal or tri-temporal. More specifically th ...
. Since a ''current database'' stores only ''currently'' valid data, there is a one-to-one correspondence between a surrogate in the modeled world and the primary key of the database. In this case the surrogate may be used as a primary key, resulting in the term ''surrogate key''. In a temporal database, however, there is a many-to-one relationship between primary keys and the surrogate. Since there may be several objects in the database corresponding to a single surrogate, we cannot use the surrogate as a primary key; another attribute is required, in addition to the surrogate, to uniquely identify each object. Although Hall ''et al.'' (1976) say nothing about this, others have argued that a surrogate should have the following characteristics: * the value is never reused * the value is system generated * the value is not manipulable by the user or application * the value contains no semantic meaning * the value is not visible to the user or application * the value is not composed of several values from different domains.


Surrogates in practice

In a current database, the surrogate key can be the
primary key In the relational model of databases, a primary key is a ''specific choice'' of a ''minimal'' set of attributes (Column (database), columns) that uniquely specify a tuple (Row (database), row) in a Relation (database), relation (Table (database), t ...
, generated by the
database management system In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...
and ''not'' derived from any application data in the database. The only significance of the surrogate key is to act as the primary key. It is also possible that the surrogate key exists in addition to the database-generated
UUID A universally unique identifier (UUID) is a 128-bit label used for information in computer systems. The term globally unique identifier (GUID) is also used. When generated according to the standard methods, UUIDs are, for practical purposes, un ...
(for example, an HR number for each employee other than the UUID of each employee). A surrogate key is frequently a sequential number (e.g. a Sybase or SQL Server "identity column", a
PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
or
Informix IBM Informix is a product family within IBM's Information Management division that is centered on several relational database management system (RDBMS) offerings. The Informix products were originally developed by Informix Corporation, whose ...
serial, an
Oracle An oracle is a person or agency considered to provide wise and insightful counsel or prophetic predictions, most notably including precognition of the future, inspired by deities. As such, it is a form of divination. Description The word '' ...
or SQL Server SEQUENCE or a column defined with AUTO_INCREMENT in
MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...
). Some databases provide
UUID A universally unique identifier (UUID) is a 128-bit label used for information in computer systems. The term globally unique identifier (GUID) is also used. When generated according to the standard methods, UUIDs are, for practical purposes, un ...
/
GUID A universally unique identifier (UUID) is a 128-bit label used for information in computer systems. The term globally unique identifier (GUID) is also used. When generated according to the standard methods, UUIDs are, for practical purposes, uni ...
as a possible data type for surrogate keys (e.g
PostgreSQL UUID
o
SQL Server UNIQUEIDENTIFIER
. Having the key independent of all other columns insulates the database relationships from changes in data values or database design (making the database more agile) and guarantees uniqueness. In a
temporal database A temporal database stores data relating to time instances. It offers temporal data types and stores information relating to past, present and future time. Temporal databases could be uni-temporal, bi-temporal or tri-temporal. More specifically th ...
, it is necessary to distinguish between the surrogate key and the business key. Every row would have both a business key and a surrogate key. The surrogate key identifies one unique row in the database, the business key identifies one unique entity of the modeled world. One table row represents a slice of time holding all the entity's attributes for a defined timespan. Those slices depict the whole lifespan of one business entity. For example, a table ''EmployeeContracts'' may hold temporal information to keep track of contracted working hours. The business key for one contract will be identical (non-unique) in both rows however the surrogate key for each row is unique. Some database designers use surrogate keys systematically regardless of the suitability of other
candidate key A candidate key, or simply a key, of a relational database is a minimal superkey. In other words, it is any set of columns that have a unique combination of values in each row (which makes it a superkey), with the additional constraint that removi ...
s, while others will use a key already present in the data, if there is one. Some of the alternate names ("system-generated key") describe the way of ''generating'' new surrogate values rather than the ''nature'' of the surrogate concept. Approaches to generating surrogates include: *
Universally Unique Identifier A universally unique identifier (UUID) is a 128-bit nominal number, label used for information in computer systems. The term globally unique identifier (GUID) is also used. When generated according to the standard methods, UUIDs are, for practic ...
s (UUIDs) *
Globally Unique Identifier A universally unique identifier (UUID) is a 128-bit label used for information in computer systems. The term globally unique identifier (GUID) is also used. When generated according to the standard methods, UUIDs are, for practical purposes, un ...
s (GUIDs) *
Object Identifier In computing, object identifiers or OIDs are an identifier mechanism standardized by the International Telecommunication Union (ITU) and ISO/IEC for naming any object, concept, or "thing" with a globally unambiguous persistent name. Syntax and le ...
s (OIDs) * Sybase or SQL Server identity column IDENTITY OR IDENTITY(n,n) *
Oracle An oracle is a person or agency considered to provide wise and insightful counsel or prophetic predictions, most notably including precognition of the future, inspired by deities. As such, it is a form of divination. Description The word '' ...
SEQUENCE, or GENERATED AS IDENTITY (starting from version 12.1) * SQL Server SEQUENCE (starting from SQL Server 2012) *
PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
or
IBM Informix IBM Informix is a product family within IBM's Information Management division that is centered on several relational database management system (RDBMS) offerings. The Informix products were originally developed by Informix Corporation, whose ...
serial *
MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...
AUTO_INCREMENT *
SQLite SQLite (, ) is a database engine written in the C programming language. It is not a standalone app; rather, it is a library that software developers embed in their apps. As such, it belongs to the family of embedded databases. It is the m ...
INTEGER PRIMARY KEY (if AUTOINCREMENT is used it will prevent the reuse of numbers that have already been used but are available) * AutoNumber data type in
Microsoft Access Microsoft Access is a database management system (DBMS) from Microsoft that combines the relational Access Database Engine (ACE) with a graphical user interface and software-development tools (not to be confused with the old Microsoft Access ...
* AS IDENTITY GENERATED BY DEFAULT in
IBM Db2 Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON a ...
&
PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
. * Identity column (implemented in DDL) in
Teradata Teradata Corporation is an American software company that provides cloud database and analytics-related software, products, and services. The company was formed in 1979 in Brentwood, California, as a collaboration between researchers at Caltech ...
* Table Sequence when the sequence is calculated by a procedure and a sequence table with fields: id, sequenceName, sequenceValue and incrementValue


Advantages


Stability

Surrogate keys typically do not change while the row exists. This has the following advantages: * Applications cannot lose their reference to a row in the database (since the identifier does not change). * The primary or natural key data can always be modified, even with databases that do not support cascading updates across related
foreign keys ''Foreign Keys'' is the second album released in 1985, 11th album overall, by musician Jandek, and his eleventh overall. This is the first Jandek album featuring a full band, and is without acoustic numbers. It is also an album split between tra ...
.


Requirement changes

Attributes that uniquely identify an entity might change, which might invalidate the suitability of natural keys. Consider the following example: :An employee's network user name is chosen as a natural key. Upon merging with another company, new employees must be inserted. Some of the new network user names create conflicts because their user names were generated independently (when the companies were separate). In these cases, generally a new attribute must be added to the natural key (for example, an ''original_company'' column). With a surrogate key, only the table that defines the surrogate key must be changed. With natural keys, all tables (and possibly other, related software) that use the natural key will have to change. Some problem domains do not clearly identify a suitable natural key. Surrogate keys avoid choosing a natural key that might be incorrect.


Performance

Surrogate keys tend to be a compact data type, such as a four-byte integer. This allows the database to query the single key column faster than it could multiple columns. Furthermore, a non-redundant distribution of keys causes the resulting
b-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for ...
index to be completely balanced. Surrogate keys are also less expensive to join (fewer columns to compare) than
compound key {{Unreferenced, date=October 2020 In database design, a composite key is a candidate key that consists of two or more attributes (table columns) that together uniquely identify an entity occurrence (table row). A compound key is a composite key f ...
.


Compatibility

While using several database application development systems, drivers, and object-relational mapping systems, such as
Ruby on Rails Ruby on Rails (simplified as Rails) is a server-side web application framework written in Ruby under the MIT License. Rails is a model–view–controller (MVC) framework, providing default structures for a database, a web service, and we ...
or
Hibernate Hibernation is a state of minimal activity and metabolic depression undergone by some animal species. Hibernation is a seasonal heterothermy characterized by low body-temperature, slow breathing and heart-rate, and low metabolic rate. It most ...
, it is much easier to use an integer or GUID surrogate keys for every table instead of natural keys in order to support database-system-agnostic operations and object-to-row mapping.


Uniformity

When every table has a uniform surrogate key, some tasks can be easily automated by writing the code in a table-independent way.


Validation

It is possible to design key-values that follow a well-known pattern or structure which can be automatically verified. For instance, the keys that are intended to be used in some column of some table might be designed to "look differently from" those that are intended to be used in another column or table, thereby simplifying the detection of application errors in which the keys have been misplaced. However, this characteristic of the surrogate keys should never be used to drive any of the logic of the applications themselves, as this would violate the principles of
Database normalization Database normalization or database normalisation (see spelling differences) is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity ...
.


Disadvantages


Disassociation

The values of generated surrogate keys have no relationship to the real-world ''meaning'' of the data held in a row. When inspecting a row holding a foreign key reference to another table using a surrogate key, the meaning of the surrogate key's row cannot be discerned from the key itself. Every foreign key must be joined to see the related data item. If appropriate database constraints have not been set, or data imported from a legacy system where
referential integrity Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute (ei ...
was not employed, it is possible to have a foreign-key value that does not correspond to a primary-key value and is therefore invalid. (In this regard,
C.J. Date Chris Date (born 1941) is an independent author, lecturer, researcher, and consultant, specializing in relational database theory. Biography Chris Date attended High Wycombe Royal Grammar School (U.K.) from 1951 to 1958 and received his BA i ...
regards the meaninglessness of surrogate keys as an advantage. C.J. Date. The primacy of primary keys. From "Relational Database Writings, 1991-1994. Addison-Wesley, Reading, MA. ) To discover such errors, one must perform a query that uses a left
outer join A join clause in SQL – corresponding to a join operation in relational algebra – combines columns from one or more tables into a new table. Informally, a join stitches two tables and puts on the same row records with matching fields : INNER, ...
between the table with the foreign key and the table with the primary key, showing both key fields in addition to any fields required to distinguish the record; all invalid foreign-key values will have the primary-key column as NULL. The need to perform such a check is so common that Microsoft Access actually provides a "Find Unmatched Query" wizard that generates the appropriate SQL after walking the user through a dialog. (It is, however, not too difficult to compose such queries manually.) "Find Unmatched" queries are typically employed as part of a
data cleansing Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the dat ...
process when inheriting legacy data. Surrogate keys are unnatural for data that is exported and shared. A particular difficulty is that tables from two otherwise identical schemas (for example, a test schema and a development schema) can hold records that are equivalent in a business sense, but have different keys. This can be mitigated by NOT exporting surrogate keys, except as transient data (most obviously, in executing applications that have a "live" connection to the database). When surrogate keys supplant natural keys, then domain specific
referential integrity Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute (ei ...
will be compromised. For example, in a customer master table, the same customer may have multiple records under separate customer IDs, even though the natural key (a combination of customer name, date of birth, and e-mail address) would be unique. To prevent compromise, the natural key of the table must NOT be supplanted: it must be preserved as a unique constraint, which is implemented as a unique index on the combination of natural-key fields.


Query optimization

Relational databases assume a unique
index Index (or its plural form indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastru ...
is applied to a table's primary key. The unique index serves two purposes: (i) to enforce entity integrity, since primary key data must be unique across rows and (ii) to quickly search for rows when queried. Since surrogate keys replace a table's identifying attributes—the natural key—and since the identifying attributes are likely to be those queried, then the query optimizer is forced to perform a full table scan when fulfilling likely queries. The remedy to the full table scan is to apply indexes on the identifying attributes, or sets of them. Where such sets are themselves a
candidate key A candidate key, or simply a key, of a relational database is a minimal superkey. In other words, it is any set of columns that have a unique combination of values in each row (which makes it a superkey), with the additional constraint that removi ...
, the index can be a unique index. These additional indexes, however, will take up disk space and slow down inserts and deletes.


Normalization

Surrogate keys can result in duplicate values in any natural keys. To prevent duplication, one must preserve the role of the natural keys as unique constraints when defining the table using either SQL's CREATE TABLE statement or ALTER TABLE ... ADD CONSTRAINT statement, if the constraints are added as an afterthought.


Business process modeling

Because surrogate keys are unnatural, flaws can appear when modeling the business requirements. Business requirements, relying on the natural key, then need to be translated to the surrogate key. A strategy is to draw a clear distinction between the logical model (in which surrogate keys do not appear) and the physical implementation of that model, to ensure that the logical model is correct and reasonably well normalised, and to ensure that the physical model is a correct implementation of the logical model.


Inadvertent disclosure

Proprietary information can be leaked if surrogate keys are generated sequentially. By subtracting a previously generated sequential key from a recently generated sequential key, one could learn the number of rows inserted during that time period. This could expose, for example, the number of transactions or new accounts per period. For example see
German tank problem In the statistical theory of estimation theory, estimation, the German tank problem consists of estimating the maximum of a discrete uniform distribution from sampling without replacement. In simple terms, suppose there exists an unknown number o ...
. There are a few ways to overcome this problem: * increase the sequential number by a random amount; * generate a random key such as a
UUID A universally unique identifier (UUID) is a 128-bit label used for information in computer systems. The term globally unique identifier (GUID) is also used. When generated according to the standard methods, UUIDs are, for practical purposes, un ...
.


Inadvertent assumptions

Sequentially generated surrogate keys can imply that events with a higher key value occurred after events with a lower value. This is not necessarily true, because such values do not guarantee time sequence as it is possible for inserts to fail and leave gaps which may be filled at a later time. If chronology is important then date and time must be separately recorded.


See also

* Natural key *
Object identifier In computing, object identifiers or OIDs are an identifier mechanism standardized by the International Telecommunication Union (ITU) and ISO/IEC for naming any object, concept, or "thing" with a globally unambiguous persistent name. Syntax and le ...
*
Persistent Object Identifier In database design, a Persistent Object Identifier (POID) is a unique identifier of a record on a table, used as the primary key. Important characteristics of a POID are that it does not carry business information and are not generally exported or ...


References


Citations


Sources

* * Engles, R.W.: (1972), ''A Tutorial on Data-Base Organization'', Annual Review in Automatic Programming, Vol.7, Part 1, Pergamon Press, Oxford, pp. 1–64. * Langefors, B (1968). ''Elementary Files and Elementary File Records'', Proceedings of File 68, an IFIP/IAG International Seminar on File Organisation, Amsterdam, November, pp. 89–96. * * * * * {{DEFAULTSORT:Surrogate Key Data modeling zh:关系键#.E4.BB.A3.E7.90.86.E9.8D.B5